Profiting from Mark-Up: Hyper-Text Annotations for Guided Parsing

نویسندگان

  • Valentin I. Spitkovsky
  • Daniel Jurafsky
  • Hiyan Alshawi
چکیده

We show how web mark-up can be used to improve unsupervised dependency parsing. Starting from raw bracketings of four common HTML tags (anchors, bold, italics and underlines), we refine approximate partial phrase boundaries to yield accurate parsing constraints. Conversion procedures fall out of our linguistic analysis of a newly available million-word hyper-text corpus. We demonstrate that derived constraints aid grammar induction by training Klein and Manning’s Dependency Model with Valence (DMV) on this data set: parsing accuracy on Section 23 (all sentences) of the Wall Street Journal corpus jumps to 50.4%, beating previous state-of-theart by more than 5%. Web-scale experiments show that the DMV, perhaps because it is unlexicalized, does not benefit from orders of magnitude more annotated but noisier data. Our model, trained on a single blog, generalizes to 53.3% accuracy out-of-domain, against the Brown corpus — nearly 10% higher than the previous published best. The fact that web mark-up strongly correlates with syntactic structure may have broad applicability in NLP.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Symbolic Representation of Musical Chords: A Proposed Syntax for Text Annotations

In this paper we propose a text represention for musical chord symbols that is simple and intuitive for musically trained individuals to write and understand, yet highly structured and unambiguous to parse with computer programs. When designing feature extraction algorithms, it is important to have a hand annotated test set providing a ground truth to compare results against. Hand labelling of ...

متن کامل

Discriminative Learning with Natural Annotations: Word Segmentation as a Case Study

Structural information in web text provides natural annotations for NLP problems such as word segmentation and parsing. In this paper we propose a discriminative learning algorithm to take advantage of the linguistic knowledge in large amounts of natural annotations on the Internet. It utilizes the Internet as an external corpus with massive (although slight and sparse) natural annotations, and...

متن کامل

Large-scale Semantic Parsing without Question-Answer Pairs

In this paper we introduce a novel semantic parsing approach to query Freebase in natural language without requiring manual annotations or question-answer pairs. Our key insight is to represent natural language via semantic graphs whose topology shares many commonalities with Freebase. Given this representation, we conceptualize semantic parsing as a graph matching problem. Our model converts s...

متن کامل

VnCoreNLP: A Vietnamese Natural Language Processing Toolkit

We present an easy-to-use and fast toolkit, namely VnCoreNLP—a Java NLP annotation pipeline for Vietnamese. Our VnCoreNLP supports key natural language processing (NLP) tasks including word segmentation, part-of-speech (POS) tagging, named entity recognition (NER) and dependency parsing, and obtains state-of-the-art (SOTA) results for these tasks. We release VnCoreNLP to provide rich linguistic...

متن کامل

Syntactic parsing with NooJ

When parsing a text, NooJ’s parsers store all the annotations that they produce in the Text’s Annotation Structure (TAS). At each level of the various linguistic analyses and the corresponding parser, a given parser may add annotations to, or remove annotations from, the TAS. As annotations are attached to larger and larger sequences of texts, the TAS represents the hierarchical structure of th...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010